517 research outputs found
MemPool: A Scalable Manycore Architecture with a Low-Latency Shared L1 Memory
Shared L1 memory clusters are a common architectural pattern (e.g., in
GPGPUs) for building efficient and flexible multi-processing-element (PE)
engines. However, it is a common belief that these tightly-coupled clusters
would not scale beyond a few tens of PEs. In this work, we tackle scaling
shared L1 clusters to hundreds of PEs while supporting a flexible and
productive programming model and maintaining high efficiency. We present
MemPool, a manycore system with 256 RV32IMAXpulpimg "Snitch" cores featuring
application-tunable functional units. We designed and implemented an efficient
low-latency PE to L1-memory interconnect, an optimized instruction path to
ensure each PE's independent execution, and a powerful DMA engine and system
interconnect to stream data in and out. MemPool is easy to program, with all
the cores sharing a global view of a large, multi-banked, L1 scratchpad memory,
accessible within at most five cycles in the absence of conflicts. We provide
multiple runtimes to program MemPool at different abstraction levels and
illustrate its versatility with a wide set of applications. MemPool runs at 600
MHz (60 gate delays) in typical conditions (TT/0.80V/25{\deg}C) in 22 nm FDX
technology and achieves a performance of up to 229 GOPS or 192 GOPS/W with less
than 2% of execution stalls.Comment: 14 pages, 17 figures, 2 table
LRSCwait: Enabling Scalable and Efficient Synchronization in Manycore Systems through Polling-Free and Retry-Free Operation
Extensive polling in shared-memory manycore systems can lead to contention,
decreased throughput, and poor energy efficiency. Both lock implementations and
the general-purpose atomic operation, load-reserved/store-conditional (LRSC),
cause polling due to serialization and retries. To alleviate this overhead, we
propose LRwait and SCwait, a synchronization pair that eliminates polling by
allowing contending cores to sleep while waiting for previous cores to finish
their atomic access. As a scalable implementation of LRwait, we present
Colibri, a distributed and scalable approach to managing LRwait reservations.
Through extensive benchmarking on an open-source RISC-V platform with 256
cores, we demonstrate that Colibri outperforms current synchronization
approaches for various concurrent algorithms with high and low contention
regarding throughput, fairness, and energy efficiency. With an area overhead of
only 6%, Colibri outperforms LRSC-based implementations by a factor of 6.5x in
terms of throughput and 7.1x in terms of energy efficiency.Comment: 6 pages, 6 figures, 2 tables, accepted as a regular paper at DATE2
Spatz: A Compact Vector Processing Unit for High-Performance and Energy-Efficient Shared-L1 Clusters
While parallel architectures based on clusters of Processing Elements (PEs)
sharing L1 memory are widespread, there is no consensus on how lean their PE
should be. Architecting PEs as vector processors holds the promise to greatly
reduce their instruction fetch bandwidth, mitigating the Von Neumann Bottleneck
(VNB). However, due to their historical association with supercomputers,
classical vector machines include micro-architectural tricks to improve the
Instruction Level Parallelism (ILP), which increases their instruction fetch
and decode energy overhead. In this paper, we explore for the first time vector
processing as an option to build small and efficient PEs for large-scale
shared-L1 clusters. We propose Spatz, a compact, modular 32-bit vector
processing unit based on the integer embedded subset of the RISC-V Vector
Extension version 1.0. A Spatz-based cluster with four Multiply-Accumulate
Units (MACUs) needs only 7.9 pJ per 32-bit integer multiply-accumulate
operation, 40% less energy than an equivalent cluster built with four Snitch
scalar cores. We analyzed Spatz' performance by integrating it within MemPool,
a large-scale many-core shared-L1 cluster. The Spatz-based MemPool system
achieves up to 285 GOPS when running a 256x256 32-bit integer matrix
multiplication, 70% more than the equivalent Snitch-based MemPool system. In
terms of energy efficiency, the Spatz-based MemPool system achieves up to 266
GOPS/W when running the same kernel, more than twice the energy efficiency of
the Snitch-based MemPool system, which reaches 128 GOPS/W. Those results show
the viability of lean vector processors as high-performance and
energy-efficient PEs for large-scale clusters with tightly-coupled L1 memory.Comment: 9 pages. Accepted for publication in the 2022 International
Conference on Computer-Aided Design (ICCAD 2022
Fast Shared-Memory Barrier Synchronization for a 1024-Cores RISC-V Many-Core Cluster
Synchronization is likely the most critical performance killer in
shared-memory parallel programs. With the rise of multi-core and many-core
processors, the relative impact on performance and energy overhead of
synchronization is bound to grow. This paper focuses on barrier synchronization
for TeraPool, a cluster of 1024 RISC-V processors with non-uniform memory
access to a tightly coupled 4MB shared L1 data memory. We compare the
synchronization strategies available in other multi-core and many-core clusters
to identify the optimal native barrier kernel for TeraPool. We benchmark a set
of optimized barrier implementations and evaluate their performance in the
framework of the widespread fork-join Open-MP style programming model. We test
parallel kernels from the signal-processing and telecommunications domain,
achieving less than 10% synchronization overhead over the total runtime for
problems that fit TeraPool's L1 memory. By fine-tuning our tree barriers, we
achieve 1.6x speed-up with respect to a naive central counter barrier and just
6.2% overhead on a typical 5G application, including a challenging multistage
synchronization kernel. To our knowledge, this is the first work where
shared-memory barriers are used for the synchronization of a thousand
processing elements tightly coupled to shared data memory.Comment: 15 pages, 7 figure
The Solar Neighborhood. XXXIV. A Search for Planets Orbiting Nearby M Dwarfs using Astrometry
Astrometric measurements are presented for seven nearby stars with previously
detected planets: six M dwarfs (GJ 317, GJ 667C, GJ 581, GJ 849, GJ 876, and GJ
1214) and one K dwarf (BD 10 3166). Measurements are also presented for six
additional nearby M dwarfs without known planets, but which are more favorable
to astrometric detections of low mass companions, as well as three binary
systems for which we provide astrometric orbit solutions. Observations have
baselines of three to thirteen years, and were made as part of the RECONS
long-term astrometry and photometry program at the CTIO/SMARTS 0.9m telescope.
We provide trigonometric parallaxes and proper motions for all 16 systems, and
perform an extensive analysis of the astrometric residuals to determine the
minimum detectable companion mass for the 12 M dwarfs not having close stellar
secondaries. For the six M dwarfs with known planets, we are not sensitive to
planets, but can rule out the presence of all but the least massive brown
dwarfs at periods of 2 - 12 years. For the six more astrometrically favorable M
dwarfs, we conclude that none have brown dwarf companions, and are sensitive to
companions with masses as low as 1 for periods longer than two years.
In particular, we conclude that Proxima Centauri has no Jovian companions at
orbital periods of 2 - 12 years. These results complement previously published
M dwarf planet occurrence rates by providing astrometrically determined upper
mass limits on potential super-Jupiter companions at orbits of two years and
longer. As part of a continuing survey, these results are consistent with the
paucity of super-Jupiter and brown dwarf companions we find among the over 250
red dwarfs within 25 pc observed longer than five years in our astrometric
program.Comment: 18 pages, 5 figures, 4 tables, accepted for publication in A
A High-performance, Energy-efficient Modular DMA Engine Architecture
Data transfers are essential in today's computing systems as latency and
complex memory access patterns are increasingly challenging to manage. Direct
memory access engines (DMAEs) are critically needed to transfer data
independently of the processing elements, hiding latency and achieving high
throughput even for complex access patterns to high-latency memory. With the
prevalence of heterogeneous systems, DMAEs must operate efficiently in
increasingly diverse environments. This work proposes a modular and highly
configurable open-source DMAE architecture called intelligent DMA (iDMA), split
into three parts that can be composed and customized independently. The
front-end implements the control plane binding to the surrounding system. The
mid-end accelerates complex data transfer patterns such as multi-dimensional
transfers, scattering, or gathering. The back-end interfaces with the on-chip
communication fabric (data plane). We assess the efficiency of iDMA in various
instantiations: In high-performance systems, we achieve speedups of up to 15.8x
with only 1 % additional area compared to a base system without a DMAE. We
achieve an area reduction of 10 % while improving ML inference performance by
23 % in ultra-low-energy edge AI systems over an existing DMAE solution. We
provide area, timing, latency, and performance characterization to guide its
instantiation in various systems.Comment: 14 pages, 14 figures, accepted by an IEEE journal for publicatio
MemPool-3D: Boosting Performance and Efficiency of Shared-L1 Memory Many-Core Clusters with 3D Integration
Three-dimensional integrated circuits promise power, performance, and
footprint gains compared to their 2D counterparts, thanks to drastic reductions
in the interconnects' length through their smaller form factor. We can leverage
the potential of 3D integration by enhancing MemPool, an open-source many-core
design with 256 cores and a shared pool of L1 scratchpad memory connected with
a low-latency interconnect. MemPool's baseline 2D design is severely limited by
routing congestion and wire propagation delay, making the design ideal for 3D
integration. In architectural terms, we increase MemPool's scratchpad memory
capacity beyond the sweet spot for 2D designs, improving performance in a
common digital signal processing kernel. We propose a 3D MemPool design that
leverages a smart partitioning of the memory resources across two layers to
balance the size and utilization of the stacked dies. In this paper, we explore
the architectural and the technology parameter spaces by analyzing the power,
performance, area, and energy efficiency of MemPool instances in 2D and 3D with
1 MiB, 2 MiB, 4 MiB, and 8 MiB of scratchpad memory in a commercial 28 nm
technology node. We observe a performance gain of 9.1% when running a matrix
multiplication on the MemPool-3D design with 4 MiB of scratchpad memory
compared to the MemPool 2D counterpart. In terms of energy efficiency, we can
implement the MemPool-3D instance with 4 MiB of L1 memory on an energy budget
15% smaller than its 2D counterpart, and even 3.7% smaller than the MemPool-2D
instance with one-fourth of the L1 scratchpad memory capacity.Comment: Accepted for publication in DATE 2022 -- Design, Automation and Test
in Europe Conferenc
- …